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Abstract 


In  this  paper  we  focus  on  the  problem  of  estimating  a  bounded  density  using 
a  hnite  combination  of  densities  from  a  given  class.  We  consider  the  Maximum 
Likelihood  Procedure  (MLE)  and  the  greedy  procedure  described  by  Li  and  Barron 
[6,  7].  Approximation  and  estimation  bounds  are  given  for  the  above  methods.  We 
extend  and  improve  upon  the  estimation  results  of  Li  and  Barron,  and  in  particular 
prove  an  0(;^)  bound  on  the  estimation  error  which  does  not  depend  on  the  number 
of  densities  in  the  estimated  combination. 
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1  Introduction 


In  the  density  estimation  problem,  we  are  given  n  i.i.d.  samples  S  =  {xi,  ...,Xn}  drawn  from  an 
unknown  density  /.  The  goal  is  to  estimate  this  density  from  the  given  data.  We  consider  the 
Maximum  Likelihood  Procedure  (MLE)  and  the  greedy  procedure  described  by  Li  and  Barron  [6, 
7]  and  prove  estimation  bounds  for  these  procedures.  Rates  of  convergence  for  density  estimation 
were  studied  in  [3,  10,  11,  13].  For  neural  networks  and  projection  pursuit,  approximation  and 
estimation  bounds  can  be  found  in  [1,  2,  4,  9]. 

To  evaluate  the  accuracy  of  the  density  estimate  we  need  a  notion  of  distance.  Kullback-Leibler 
(KL)  divergence  and  Bellinger  distance  are  the  most  commonly  used.  Li  and  Barron  [6,  7] 
give  hnal  bounds  in  terms  of  KL-divergence,  and  since  our  paper  extends  and  improves  upon 
their  results,  we  will  be  using  this  notion  of  distance  as  well.  The  KL-divergence  between  two 
distributions  is  dehned  as 

fix)  log  dx  =  lElog  — . 
g[x)  g 

The  expectation  here  is  assumed  to  be  with  respect  to  x,  which  comes  from  a  distribution  with 
the  density  f{x). 

Consider  a  parametric  family  of  probability  density  functions  Ti  =  {06»(2^)  '■  0  G  Q  C  IR'^}.  The 
class  of  /c-component  mixtures  fk  is  dehned  as 

{k  k 

f  :  f{x)  =  ^  Ai0e,(a;),  ^  A*  =  1, 6^^  e  0 

i=l  i=l 

Approximation  results  will  depend  on  the  following  class  of  continuous  convex  combinations 
(with  respect  to  all  measures  P  on  0) 


b(/ii9) = y 


C  =  coriv(H)  =  |  /  :  /(*)  =  j  ■ 

The  approximation  bound  of  Li  and  Barron  [6,  7]  states  that  for  any  /,  there  exists  an  G 
such  that 

D(f\\h)  <  D(f\\C)  +  (1) 

where  c/^p  and  7  are  constants  and  D{f\\C)  =  inf g^c  D{f\\g).  Furthermore,  7  upperbounds  the 
log-ratio  of  any  two  functions  (f>d(x),  fefx)  for  all  0,  6\x  and  therefore 


sup  log  - — <  CX) 
e,e\x  (pe'\x) 


(2) 


is  a  condition  on  the  class  Ti. 

Li  and  Barron  prove  that  /c-mixture  approximations  satisfying  (1)  can  be  constructed  by  the 
following  greedy  procedure:  Initialize  /i  =  fe  to  minimize  D{f\\fi)  and  at  step  k  construct  fk 
from  fk-i  by  Ending  a  and  6  such  that 

D{f\\fk)  <  minL)(/||(l  -  a)fk-i{x)  +  afeix)). 

(y,0 
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Furthermore,  a  connection  between  KL-divergence  and  Maximum  Likelihood  suggests  the  fol¬ 
lowing  method  to  compute  the  estimate  fk  from  the  data  by  greedily  choosing  (f)g  at  step  k  so 
that 

n  n 

'^log  fk{xi)  >  max^log[(l  -  a)fk-i{xi)  +  a(t)e{xi)]  (3) 

i=l  i=l 

Li  and  Barron  proved  the  following  theorem: 

Theorem  1.1.  Let  fk{x)  he  either  the  maximizer  of  the  likelihood  over  k-component  mixtures  or 
more  generally  any  sequenee  of  density  estimates  satisfying  (3).  Assume  additionally  that  0  is 
a  d-dimensional  cube  with  side-length  A,  and  that 


for  any  9,9'  G  0.  Then 


d 

sup  I  log  (feix)  -  log  (fd'ix)  I  <  B'^\9j  -  9' . 


IE  c 


DifWfk 


D{f\\C)  <  y  +  —  log(nc3), 
k  n 


where  Ci,C2,C3  are  constants  (dependent  on  A,B,d). 


(4) 

(5) 


Here  1E5  denotes  the  expectation  with  respect  to  a  draw  of  n  independent  points  according  to  the 
unknown  distribution  /.  The  above  bound  combines  the  approximation  and  estimation  results. 
Note  that  the  first  term  decreases  with  the  number  of  components  k,  while  the  second  term 

increases.  The  rate  of  convergence  for  the  optimal  k  is  therefore 


2  Main  Results 

Instead  of  condition  (2),  we  assume  that  class  Ti  consists  of  functions  bounded  above  and  below 
by  a  and  h,  respectively.  See  the  discussion  section  for  the  comparison  of  these  two  assumptions. 
We  prove  the  following  results: 

Theorem  2.1.  For  any  target  density  f  such  that  a  <  f  <  b  and  fk{x)  either  the  maximizer 
of  the  likelihood  over  k-component  mixtures  or  more  generally  any  sequence  of  density  estimates 
satisfying  (3), 


TFc 


D{f\\fk)\  -Z1(/||C)<|  +  ]E5 


/  log^/^'D(7i:,e,  4)de 
'n  ./o 


where  ci,C2  are  constants  (dependent  on  a,b)  and  V{7{,e,dx)  is  the  covering  number  of  TC  at 
scale  e  with  respect  to  empirical  distance  d^. 


Corollary  2.1.  Under  the  conditions  of  Theorem  1.1  (i.e.  TL  satisfying  condition  (4)  and  0 
being  a  cube  with  side-length  A),  the  bound  of  Theorem  2.1  becomes 


IFc 


DifWfk 


-D{f\\C)< 


Cl  C2 

k 


where  ci  and  C2  are  constants  (dependent  on  a,b,  A,  B,d). 
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3  Discussion  of  the  Results 


The  result  of  Theorem  2.1  is  twofold.  The  hrst  implication  concerns  dependence  of  the  bound  on 
k,  the  number  of  components.  Our  results  show  that  there  is  an  estimation  bound  of  the  order 
0{^)  that  does  not  depend  on  k.  Therefore,  the  number  of  components  is  not  a  trade-off  that 
has  to  be  made  with  the  approximation  part. 

The  second  implication  concerns  the  rate  of  convergence  in  terms  of  n,  the  number  of  samples. 
The  rate  of  convergence  (in  the  sense  of  KL-divergence)  of  the  estimated  mixture  to  the  true 
density  is  of  the  order  0{l/^/n).  As  Corollary  2.1  shows,  for  the  specihc  class  Ti.  considered  by 
Li  and  Barron,  the  Dudley  integral  converges  and  does  not  depend  on  n.  Furthermore,  the  result 
of  this  paper  holds  for  general  base  classes  with  a  converging  entropy  integral,  extending  and 
improving  the  result  of  Li  and  Barron.  Note  that  the  bound  of  Theorem  2.1  is  in  terms  of  the 
metric  entropy  of  H,  as  opposed  to  the  metric  entropy  of  C.  This  is  a  strong  result  because  the 
convex  class  C  can  be  very  large  [8]  even  for  small  7d. 

Rates  of  convergence  for  the  MLE  in  mixture  models  were  recently  studied  by  Sara  van  de  Geer 
[10].  As  the  author  notes,  the  optimality  of  the  rates  depends  primarily  on  the  optimality  of 
the  entropy  calculations.  Unfortunately,  in  the  results  of  [10],  the  entropy  of  the  convex  class 
appears  in  the  bounds,  which  is  undesirable.  Moreover,  only  hnite  combinations  are  considered. 
Wong  and  Shen  [13]  also  considered  density  estimation,  giving  rates  of  convergence  in  Bellinger 
distance  for  a  class  of  bounded  Lipschitz  densities.  In  their  work,  again,  a  bound  on  the  metric 
entropy  of  the  whole  class  is  used  and  the  rates  of  convergence  are  slower  than  those  achieved  in 
this  paper. 

An  advantage  of  the  approach  of  [10]  is  the  use  of  Bellinger  distance  to  avoid  problems  near  zero. 
Li  and  Barron  address  this  problem  by  requiring  (2),  which  is  boundedness  of  the  log  of  the  ratio 
of  two  densities.  We  address  this  problem  by  assuming  boundedness  of  the  densities  directly.  The 
two  conditions  are  equivalent  unless  we  consider  classes  consisting  only  of  unbounded  functions 
or  consisting  only  of  functions  approaching  0  at  the  same  rate  (in  which  case  condition  (2)  is 
weaker).  If  the  boundedness  of  densities  is  assumed,  as  [3]  notes,  the  KL-divergence  and  the 
Bellinger  distance  do  not  differ  by  more  than  a  multiplicative  constant. 

4  Proofs 

Assume  0  <  a  <  06)  <  6  for  all  (pg  G  H.  Constants  which  depend  only  on  a  and  b  we  will  denote 
by  c  with  various  subscripts.  The  values  of  the  constants  might  change  from  line  to  line. 

Theorem  4.1.  For  any  fixed  f,0<a<f<b  and  S  =  {xi, ...,  drawn  i.i.d  from  f,  with 
probability  at  least  1  — 

sup  -  ^  log  -  IE  log  ^  <  lEs  /  log^'^^  T>{n,  e,  dfijde  +  C2 
h&c  Jy^i)  J  Jo  J  V  n 

where  ci  and  are  constants  that  depend  on  a  and  b. 


Proof  By  Lemma  A. 3, 


hec  n 


,  hixi)  h  h(Xi 


-IE  log  ^  +2V2\og-\[^ 
/  a  V  n 
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with  probability  at  least  1  —  e  *  and  by  Lemma  A. 2, 


lEs  sup 

h&C 


1  ,  h{xi)  h 

Z^logTTTT  -lElog- 


n 


2=1 


f{Xi) 


f 


<  2IE5' e  sup 
hec 


1  h{xi 

-  >  ei  log 
n 

2=1 


f{xi 


Combining, 


sup 

/lec 


1  h{xi 

-  V  log 

n 

2  =  1 


f{Xi 


h 

IE  logy 


<  2^5^^  sup 

h&C 


1  h{xi 

-  >  e*  log 
n 

i=\ 


fiXi 


+  2V2\og-\l  - 


a  \  n 


with  probability  at  least  1  —  e“*. 

Therefore,  instead  of  bounding  the  difference  between  the  “empirical”  and  the  “expectation” , 
it  is  enough  to  bound  the  above  expectation  of  the  Rademacher  average.  This  is  a  simpler 
task,  but  first  we  have  to  deal  with  the  log  and  the  fraction  (over  /)  in  the  Rademacher  sum.  To 
eliminate  these  difficulties,  we  apply  Lemma  A.l  twice.  Once  we  reduce  our  problem  to  bounding 
the  Rademacher  sum  sup^g.^  of  the  basis  functions,  we  will  be  able  to  use  the 

entropy  of  the  class  Ti. 

Let  Pi  =  —  1.  and  note  that  f  —  —  1.  Consider  0(pi)  =  log(l  +Pi).  The  largest 

derivative  of  log(l  +p)  on  the  interval  p  G  [|  —  1,  ^  —  1]  is  at  p  =  a/h—  1  and  is  equal  to  h/a.  So, 
I  log(p+  1)  is  1-Lipschitz.  Also,  0(0)  =  0.  By  Lemma  A.l  applied  to  0(pi)  and  G  being  identity 
mapping. 


2IE5 sup 

h&C 


n 


^  e*  log 


h{xi 


2  =  1 


fiXi 


2^5  e  sup 
ftec 


n 


<  2-lE5^eSup 
a  ’  h&c 


<  2-lE5,;Sup 
a  ’  h&c 


<  2-lE5^eSup 
a  ’  /lec 


h{xi) 

n 

--E‘- 

n  “ 

h{xi) 

b 

+  2— lEg 
a 

1 

n 

2=1 

h{xi) 

J  1 

+  2 — -j=. 

a  ^/n 

The  last  inequality  holds  trivially  by  upperbounding  Li  norm  by  the  L2  norm.  Now  apply  A.l 
again  with  contraction  (pAhi)  =  a^h. 

Ji 

|0i(hi)  -  0i(Pi)|  =  -  gi\  <  \hi  -  gi\ 


2-^5  e  sup 
a  ’  fh&c 


h{xi 


-Y 


<  2-^lE5,,sup 

h£C 


n 


2=1 


Combining  the  inequalities,  with  probability  at  least  1  —  e 


-t 


sup 

hec 


1  h(xi)  h 


<  ^lEs,e  sup 
a  h&c 


n 


Y 


2=1 


+  Vs  log  -  A  /  -  + 


t  2b  1 


a  \  n  a  \/n 
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The  power  of  using  Rademacher  averages  to  estimate  complexity  comes  from  the  fact  that 
the  Rademacher  averages  of  a  class  are  equal  to  those  of  the  convex  hull.  Indeed,  consider 
sup/igc  I  n  Sr=i  1  with  h{x)  =  Jg(j)g{x)P{d6).  Since  a  linear  functional  of  convex  combina¬ 

tions  achieves  its  maximum  value  at  the  vertices,  the  above  supremum  is  equal  to 


sup 

9 


n 


y^^€i(i)0{xi) 


i=l 


•> 


the  corresponding  supremum  on  the  basis  functions  0.  Therefore, 


Eg  sup 

h&C 


n 


y^^eih{xi 


2=1 


Eg  sup 

9e0 


Next,  we  use  the  following  classical  result  [12], 


Eg  sup 
cf>en 


< /  log^/^'D(7f,e,4)<ie, 

Jo 


where  is  the  empirical  distance  with  respect  to  the  set  S. 

Putting  it  all  together,  the  following  holds  with  probability  at  least  1  —  e 


sup 

h&C 


1 

n 


log 


h{xi) 

f{Xi) 


h 

-  Elogy 


<  Es 


I  log^/^'E>(7f,e,4)de 


If  7-f  is  a  VC-subgraph  with  VC  dimension  V,  the  Dudley  integral  above  is  bounded  by  c^V  and 
we  get  convergence.  One  example  of  such  a  class  is  worked  out  in  the  Appendix  (Gaussian 
densities  over  a  bounded  domain  and  with  bounded  variance).  Another  example  is  the  class 
considered  in  [6],  and  the  cover  is  computed  for  it  in  the  proof  of  Corollary  2.1.  □ 


We  are  now  ready  to  prove  Theorem  2.1: 


Proof 

D{f\\h)-D{f\\f,)  = 


f  1  X 

E  log  log 

fk  fk{Xi 


f{xi 


+  1  -  V  log 
n 

2=1 


^-Elogf 

Jk\Xi)  Jk 


n 


i=l 


fkiXi) 


2=1 


fkiXi) 


<  2  sup 

h&C 


-  f]  log 
n  ^ 

2  =  1 


f{Xi 


h 

IE  logy 


iVioglA 

fkiXi 


n 


2=1 


n 

2=1 


<  Es 


^  /  log'/2D(7f,e,4)de 
n  Jo 


fk{X:. 

+  C2 


‘  +  iVioghhi 

"  «  ^  fi{Xi 


2  =  1 
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with  probability  at  least  1  —  e  *  (by  Theorem  4.1).  Note  that  -  X]r=i  <  0  if  is 

constructed  by  maximizing  likelihood  over  /c-component  mixtures.  If  it  is  constructed  by  a  greedy 
algoritheorem  described  in  the  previous  section,  fk  achieves  ’’almost  maximum  likelihood”  ([7]) 
in  following  sense: 


eC,  -  ^  \og{fk{xi))  >-Y^  \og{g{xi)) 
n  n 


7 


i=l 


2=1 


Here  4^  p  =  (1/n)  J ^  and  7  =  4  \og{3^/e)+A  log  Hence,  with  probability 
at  least  1  —  e“*. 


D{f\\f\)-D{f\\fk)<JEs 


^  /  \og^/^V{n,e,d,)de 

'n  ./o 


t  C-x 

+  C2A/-  +  ^. 
n  k 


We  now  write  the  overall  error  of  estimating  an  unknown  density  /  as  the  sum  of  approximation 
and  estimation  errors.  The  former  is  bounded  by  (1)  and  the  latter  is  bounded  as  above.  Note 
again  that  c^p  7  in  the  approximation  bound  (1)  are  bounded  above  by  constants  which 
depend  only  on  a  and  b.  Therefore,  with  probability  at  least  1  —  e“*. 


DifWfk)  -  D{f\\C)  =  {D{f\\fk)-D{f\\C))+{D{f\\fk)-D{f\\fk 

r*fe 


<  j_+ 


/  log^/^'E>(7f,e,4)de 


+  C2\  — . 
n 


Wn  JO 

Finally,  we  rewrite  the  above  probabilistic  statement  as  a  statement  in  terms  of  expectations. 
Let  C  =  f  +  lEs  Jq  log^'^^  e,  d^)de  and  ^  =  D{f\\fk)  -  D{f\\C).  We  have  shown  that 


IPK^C  +  C2\/-  <e  . 


Since  ^  >  0, 


/*00  POO 

^5  [^]  =  /  ^  id  >  u)  du  =  W{^>u)du+  >  u)du 

Jo  Jo  Jc 

POO 

<  C  +  /  ^id>u  +  C)du. 

Jo 


Now  set  u  =  C2\  -■  Then  t  =  c^nu^  and 


Es[d]<C+ 

Jo  V  ^ 


Hence, 


Es 


D(/i|4)|  -D(/||C)<|  +  1E5|^  /  log'/2p(7f,e,4)d6 


dn  JO 


□ 
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Remark  4.1.  In  the  actual  proof  of  the  bounds,  Li  and  Barron  [7,  6]  use  a  specific  sequence  of 
ai  for  the  finite  combinations.  The  authors  take  ai  =  1,  a2  =  and  a/c  =  |  for  k  >  2.  It  can 
be  shown  that  with  these  weights 

=  wkr)  ’ 

so  the  later  choices  have  more  weight. 

We  now  prove  Corollary  2.1: 

Proof  Since  we  consider  bounded  densities  a  <  (fg  <  b,  condition  (4)  implies  that 


Vx,  log 


X)  — 


9'[X 


<B\e-e'\L,. 


This  allows  to  bound  Loo  distances  between  functions  in  Tl  in  terms  of  the  Li  distances  between 
the  corresponding  parameters.  Since  0  is  a  d-dimensional  cube  of  side-length  A,  we  can  cover 
0  by  ’’balls”  of  Li-radius  d^.  This  cover  induces  a  cover  of  H.  For  any  fg  there  exists  an 
element  of  the  cover  fg/,  so  that  the 

dxife,  fe')  <  I/e  -  /e'U  <  be^‘^  -  b  =  e. 

Therefore,  S  =  and  the  cardinality  of  the  cover  is  {jY  =  f  2iog(^+i)l  ‘ 


\og^^‘^V(n,e,dx)de 


ABd 

- 7 - ^de. 

2  log  (I +  1) 


A  straightforward  calculation  shows  that  the  integral  above  converges. 


□ 


5  Future  Work 

The  main  drawback  of  the  approach  described  in  this  paper  is  the  need  to  lower-bound  the 
densities.  Future  work  will  focus  on  ways  to  remove  this  condition  by  using,  for  instance,  a 
truncation  argument. 

A  Appendix 

We  will  denote  fi  =  f{xi).  The  following  inequality  can  be  found  in  [5],  Theorem  4.12. 

Lemma  A.l  ([5]  Comparison  inequality  for  Rademacher  processes).  If  G  :  IR  — IR  con¬ 
vex  and  non-decreasing  andcfi  :  IR  — ^  IR  (/  =  l,..,n)  contractions  ((fiifS)  =  0  and  |(/j(s)  —  0i(t)|  < 
|s  — 1\),  then 

n  n 

lE,G(sup^ei0i(/i))  <  lE,G(sup^ei/i). 


Lemma  A. 2  ([12]  Symmetrization).  Consider  the  following  processes: 


Z(x)  = sup 

1  ” 

,  R{x)  =  sup 

1 

-5^ei/(a;i) 

/6.F 

n 

2=1 

f&r 

n 

2  =  1 

Then 


WjZ{x)  <  2WiR{x). 


Lemma  A. 3  (Application  of  McDiarmid  inequality).  For 


Z{xi,  ...,Xn)  =  sup 

h&r 


TUI  ^ 

Elog---glog  — 


the  following  holds  with  probability  at  least  1  —  e 


Z  —  lEZ  <  c\ 

V  n 

where  a  and  b  are  the  lower  and  upper  bounds  for  f  and  h,  and  c  =  2v^log 

Proof  Let  ti  =  log  and  t'  =  log  7^.  The  bound  on  the  martingale  difference  follows: 

J\^i)  ^  JK^i) 


\Z{xi,  ...,X[,  ...,Xn)  -  Z{xi,  ...,Xi,  ...,Xn)\  = 


sup 

/iSJP 


^  1  h  1  ,  , 

“7 - (tl  +  •••  +  ti  +  ...  +  tn) 

f  n 


sup 


ni'log  — - (ti  +  ...  +  +  ...  +  tn) 

f  n 


< 


1 

<  sup  — 
h&r  n 


h{xf)  h{xi) 

-jri\  -  log  TTT 

f{A)  fi^i) 


n 


1 


<  -  log - log  -  =  -2  log  -  =  d. 


n 


The  above  chain  of  inequalities  holds  because  of  triangle  inequality  and  properties  of  sup.  Ap¬ 
plying  McDiarmid’s  inequality, 


JP  {Z  —  JEZ  >  u)  <  exp 


u 


2Ec| 


=  exp 


nu 


8  log 


2  b 

O'  . 


Equivalently, 


P  I  Z-lEZ  >  c\l-  1  <  e-' 


for  constant  c  =  2^/2  log  -. 


□ 


B  Example  of  Gaussian  Densities 

Let  R  =  :  f^^„  =  ^^exp  ,  |/x|  <  M,amin  <  <  (^max}  be  a  set  of  Gaussian 

densities  defined  over  a  bounded  set  X  =  [— M,  M]  with  bounded  variance.  Here  we  show  that 
T  has  a  hnite  cover  T>{T ^  e,  da,)  =  for  some  constant  K. 
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Define 

and 


^ —  {/m.o-  •  //i,(T  €  /i  €  {— M  +  ke^  :  /c  —  0, ...,  2M / e^}} 


^ u,o-  ■  f iJ,,<T  ^  ^ ui^  ^  {^min  “1“  kdu  .  k  0,...,  {jJmax 

Thus,  d  Tp,  d  T .  We  claim  that  Tp^a  is  hnite  e-cover  for  T  with  respect  to  the  dx  norm 
(on  the  data).  For  any  fp^^  G  JF,  hrst  choose  a  function  fpi^^  G  JF^  so  that  \fi  —  fi'\  <  Cp. 
Note  that  functions  ]  ^  T  are  all  Lipschitz  because  a  is  bounded.  In  fact,  largest  derivative  of 


\f^l,a{x)  -  fp',a{x)\  < 


V2 


TTea^ 


>  -  F  I  < 


V^Trea; 


2 

min 


Furthermore,  any  fpi^„  G  JF^  can  be  approximated  by  fp/^o-'  G  d^^l,a  such  that  \a  —  a'\  <  €„.  Then 


Wx  E  X  \fp\a{x)  -  fp',a'{x)\  < 


a  a 


< 


'J 27r  ^min 


Combining  the  two  steps,  any  function  in  JF  can  be  approximated  by  a  function  in  JF  with  an 
error  at  most  (e^  -|-  ^<7)^2  The  empirical  distance 


4(//.,a,  fi7',a')  =  (  ^  '^{f^^,a{x)  -  fp',a'{x)Y)  j  <  SUp  \fp,a{x)  -  fp',a'{x)\ 

1 


2=1 


=  e. 


Choosing  ep  =  e„  =  e  we  get  the  size  of  the  cover  to  be 

2M  ((JfTiax  ^min)  4iff  ((T^Qa;  C^mm)  1  ^ 


V{X,e,dx)  =  card{Xp^e)  = 


'xaz. 


e2  e2 
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